Wine Quality by Marc Collado

Abstract

Wine manufacturing is a magnificent craft, but its elaboration process - on which highly depends its final taste, still remains unknown to most of us. Even if you were born and raised in a country with great wine tradition such as Spain, the second largest world wine producer, right after Italy. From the first moment I was captivated by this data set, that provides interesting data on such beautiful craft.

In a nutshell, wine can be considered grape juice, but with one important difference: yeast. They are single-celled microorganisms classified as members of the fungus kingdom and key part of the elaboration process of wine.

In the absence of oxygen, yeast converts the sugars of wine grapes into alcohol and carbon dioxide through the process of fermentation. The result of this chemical reaction, as well as the type of grape and yeast used, will determine the final taste of the wine. But on the other hand, mistakes or poor control over these organisms will cause wine faults, such as volatile acidity and Brettanomyces. That can affect wine taste - even make it undrinkable, when they are in such an excess that they overwhelm other components of the wine.

The way something as deterministic as a chemical reaction can have an impact over one of the most subjective human areas, such as taste, is in itself a fascinating topic. The exploration of this dataset will help us create the link (if any) between the components of the wine and its final taste. In other words, it will help us uncover the success recipe for such a crafty endeavour.

Exploring the Data Set

Before starting with a deeper data visualization, lets explore the data we have available through some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset.

## [1] 1599   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

The data set consists of 1599 observations and 12 variables (13 if you count X, which is merely a numeration). 11 of them are input variables, measurements from each wine sample. The last one is the output variable, a score between 1 and 10, that measures the wine quality determined by each critic.

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm^3)
  11. alcohol (% by volume)

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

The main feature of the dataset is the quality of the wine. The main goal will be to relate which quantitative features (1 to 11) end up influencing the quality of the wine.


Here’s the breakdown of the data for each variable, lets dive deep onto each one:

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Acids

Acids are major wine constituents and contribute greatly to its taste. In fact, acids impart the sourness or tartness that is a fundamental feature in wine taste.

The most abundant of these acids arise in the grapes themselves and carry over into the wine. However, there are also some acids that arise as a result of the fermentation process from either yeast or bacteria.

Traditionally total acidity is divided into two groups, namely the volatile acids, such as acetic acid (2), and the nonvolatile or fixed acids, such as tartaric acid (1) or citric acid (3).

Fixed Acidity - Tartaric acid

From a winemaking perspective, this acid is the most important in wine due to the prominent role it plays in maintaining the chemical stability of the wine and its color and finally in influencing the taste of the finished wine.

Most fixed acids originate in grapes, not by yeast during the fermentation process. Grapes also contain ascorbic acid (Vitamin C), but this is usually lost during the elaboration process.

Tartaric acid levels found in wine can vary greatly. For example, wines produced from cool climate grapes are high in acidity and thus taste sour. On the contrary, warm climate grapes can be lower in acid. But in general one would consider a normal range to see 1 to 10 g/L.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

As we expected, the majority of data points range between 7 and 10 g/L. Despite, the lower threshold is found way higher than the literature suggests, being above 4 g/L, it is also true that red wines usually have higher amounts of fixed acids.

Volatile acidity - Acetic acid

The most common fault across wine producers is, without question, the acetic acid.

It belongs to the volatile acids, usually not found in grapes, but originated as byproduct of the fermentation. In particular, acetic acid is a two-carbon organic acid produced in wine during or after the fermentation period.

It is the most volatile of the primary acids associated with wine and is responsible for the sour taste of vinegar. During fermentation, if the wine is exposed to oxygen, Acetobacter bacteria (present in the process) will convert the ethanol into acetic acid. This process is known as the ‘acetification’ of wine and is the primary process behind wine degradation into vinegar.

A taster’s sensitivity to acetic acid will vary, but most people can detect excessive amounts at around 0.6 g/L.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The acetic acid follows a seemingly normal distribution, similar as the one we saw previously with tartaric acid. As the research suggested, most of the values are found between 0.3 and 0.8 g/L, but interestingly enough, there are a lot of values drifting to the right, above the 0.6 g/L threshold.

Fixed acidity - Citric acid

Usually found in small quantities, citric acid can add ‘freshness’ and flavor to wines. While it is very common in citrus fruits, such as oranges or limes, citric acid is found only in very minute quantities in wine grapes. It often has a concentration about 1/20 that of tartaric acid, shown in the first histogram.

The citric acid most commonly found in wine is commercially produced acid supplements derived from fermenting sucrose solutions. These inexpensive supplements can be used by winemakers in acidification to boost the wine’s total acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The data confirms the research, regarding the concentration found in the wine in the range of 1/20 of acid tartaric. Most values have low concentration - below 0.8 g/L, even a good portion of the values have none.

Total acidity

Since all the acid measurements are expressed in the same unit (g/L), we can easily create a new variable that tracks the total acidity of each wine, adding up the three types of acid.

As expected, since the three values followed an almost perfect normal distribution, the result for the total.acid is also a normal distribution.

Residual sugar

During the fermentation, which often takes between one and two weeks, the yeast converts most of the sugars in the grape juice into ethanol (alcohol) and carbon dioxide. Depending on the type of wine though, the fermentation might stop before all the sugars have been used.

The earlier we stop the process, the sweeter the wine will be due to a higher amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 g/L and wines with greater than 45 g/L are considered sweet.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The largest majority of observations can be found between 1.5 and 3 g/L, which by any means are low sugar measurements. But it makes sense since the wines in our data set are all red wine, a variety not regarded as sweet.

Chlorides

Moving on to mineral elements found in grapes and wines. They are usually absorbed from the soil through the roots of the vine. They are present mainly in the skin, seeds and cellular walls of the pulp of the grape.

The mineral composition of a wine reflects its particular origin and development, making it unique and identifiable, which is really cool. It significantly contributes the wine’s sensory characteristics, affecting color, clearness, flavor and aroma.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

From the measurements of the chlorides in the wine, we can assess that most contain from 0.05 to 1.3 g/L of salts of chlorides, and they may have a key role on a potential salty taste of a wine.

Sulfur Dioxide

Free Sulfur Dioxide

Sulfur dioxide is mainly added as a preservative in the winemaking process. In small quantities, the presence of such chemical component is not a problem, in fact, it is also produced by the human body at the level of about 1000 mg per day. Therefore consumption of food preserved with sulfites is generally not a problem except for a few people who are deficient in the natural enzyme to break it down.

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion, and as mentioned before, it prevents microbial growth, but also the oxidation of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Total Sulfur Dioxide

The total sulfur dioxide accounts for the amount of free and bound forms of S02; sulfur compounds typically have low sensory thresholds, but in low concentrations, SO2 is mostly undetectable in wine. At free SO2 concentrations over 50 ppm (~49.94295 mg/L), SO2 becomes evident in the nose and taste of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

As the histogram shows, wines get their largest dose of sulfites from non-free compounds. To get a better view of the distribution of sulfurs (total vs. free), the next scatterplot shows each data point positioned according to their amount of total vs. free sulfurs.

The plot confirms that a higher amount of free SO2 usually goes along with a higher amount of total SO2. We can also observe a couple of outliers at the very edge of the graph that we’ll analyze later.

Density

The density of wine is close to that of water depending on the percent alcohol and sugar content. Therefore density is mainly dependent by alcohol and sugar content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

For the density we get a perfectly normal distribution curve, centered a little below water level, mainly due to the presence of alcohol, which is less dense than water. In fact, very few wines are denser than water is.

Alcohol

Here’s the distribution for alcohol:

Most of the values range between 9% and 11%, but let’s see how this correlates with density seen above.

Again seen this interesting pattern where higher levels of alcohol relate to lower density wines, which as mentioned before, makes sense.

pH

As explored above, the measure for the amount of acidity in wine is known as total acidity - related to the newly created variable wine$total.acid. This variable refers to the test that yields the total of all acids present in the wine, while strength of acidity is measured according to pH on a scale from 0 (very acidic) to 14 (very basic). Most wines are between 3 and 4 on the pH scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH give a normal curve, landing most of the values between 3 and 3.75, so more squeezed to the basic side of the scale. A little bit of research tells us that red wines, are usually more basic than its sweeter or white counterparts, which helps explain why the data is slightly skewed to the right.

Quality

Finally, the only non empirical data, the one based on the taste, in a scale from zero to ten, according to the verdict of the wine experts. Lets see actually how good are the wines we are about to study:

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Interesting. So, there are only values ranging from 3 to 8, so there are no ratings below 3, but also non above 8. And the large majority of wines fall under the 5-6 category.

Here’s the wine distribution by quality:

3: 10 (0.64%) 4: 53 (3.32%) 5: 681 (42.6%) 6: 638 (39.9%) 7: 199 (12.4%) 8: 18 (1.14%)

Bivariate Plots Section

Visualizing The Whole Data Set

To start with the bivariate plot section, we’ll start by mapping out all the data in a matrix to see if any patterns emerge.

  • At first sight there are no extremely strong (>0.7) correlations between the data variables, with the notable exceptions of:
  • Acids (as seen when aggregated the data for the three types of acids)
  • Acids <> pH
  • Acids <> density
  • Sulfurs
  • Density <> alcohol (which makes sense, since alcohol is less dense than water is)
  • Quality also seems mostly correlated with acids and alcohol

Despite, the graphics derived from the ggpairs visualization are not that useful because quality is not treated as categorical, but numerical. Turning the quality into categorical data, instead of numerical, would fix this issue. This will allow to map the quality against other continuous data variables by each categories.

The following plot replicates the idea of a matrix for all the data, but segmenting for each quality rating.

  • Box plots are showing up in the upper right section with the new factor variable, when comparing continuous data vs. discrete quality.
  • Despite getting further segmentation, data on the bottom still does not reveal visible patterns at first sight. Deeper exploration is required.

Acid vs. Quality

Compare the effect different types of acid have to the overall quality of the wine.

The same data can also be diplayed through a set of histograms.

  • There is no clear evidence that Fixed Acidity contibutes to the overall quality of the wine.
  • The quality of the wine improves with lower quantities of Volatile Acidity. The highest rated wines share a lower quantity of Volatile Acidity.
  • The opposite is true with Citric Acid, the highest rated wines tend to have higher quanitites of Citric Acid.

Sugar & Chlorides vs. Quality

Compare the effect sugar and chlorides have to the overall quality of the wine.

The same data can also be diplayed through a set of histograms.

  • As seen in the histogram for Residual Sugar, most wines stand around 3 g/L, but there’s no evidence that supports sugar plays a key factor when it comes to assess the wine quality.
  • On the other hand, it is true that highest rated wines share low concentration of Chlorides. The data is really scattered among the mid-range wines, but top rated have definitely lower volumes.

Sulfurs vs. Quality

Compare the effect sulfurs (both free and total) have to the overall quality of the wine.

At first sight it is difficult to draw any conclusions regarding sulphates and wine quality. Let’s display the same data through a set of histograms.

  • While it is true that highly rated wines share small quantities of sulfur, the same it is true for the worst rated wines.
  • On average higher levels of sulphates seem to relate to higher quality wines, but the data is full of outliers with extremely high samples that are not high quality at all.

Chlorides & Sulfurs vs. Quality

From the previous two sections, it is clear that both cholrides and sulfurs have a direct influence in the wine quality. Let’s put those together.

Indeed this is one of the most interesting plots we’ve seen so far.

Highest rated wines contain higher concentrations of sulphates. It is interesting though, this is an added component to preserve wine freshness, but supposedly is not perceptible by a human (up to a certain threshold).

Also the concentration of chlorides remains low across the whole high range of quality wines.

Density & Alcohol vs. Quality

Compare the effect density and alcohol have to the overall quality of the wine.

As seen before, levels of alcohol negatively correlate (-0.496) to density. Alcohol is less dense than water is, which is the ultimate reason for that being.

  • Higher levels of alcohol tend to positively correlate with higher quality wines.
  • The byproduct of this last bullet is that lower densities tend to also positively correlate with higher quality wines.

pH vs. Acid

Despite related, there is a subtle distinction between pH and acidity. pH refers to how strong is the acid, while the acid relates to the amount present in the substance.

  • As predicted, pH negatively correlates with the total acidity. Therefore we can conclude that the amount of acidity affects the pH (remember lower values of pH signal stronger acids).

Histograms Evolved

All the variables presented in the data set are continuous in their own nature. One of the best ways to visualize distributions across a continuous set is using a histogram, like the ones in the first section.

Regardless, we can go deeper into each histogram now that quality is a discrete variable.

  • The distribution based on the citric acidity shows that highest quality wines are grouped at the left of the histogram.

  • Something similar occurs when highest rated wines are plotted according the alcohol volume, as seen in the bivariate section.

Breaking It Down

Back to the matrix of plots seen in the Bivariate section, there are still unexplored plots that look interesting at first sight, but can be explored deeper by using categorical data for wine quality.

This set of plots are quite interesting. Based on the alcohol levels, which already shown a strong correlation with high quality, they map out the segments and distribution of different wine qualities vs. a third metric.

  • High quality wines only exist in a small subset of chlorides, given high volumes of alcohol.
  • High quality wines exist within a wide range of citric acid.
  • Usually lower quality wines exist across a wider range of other variables.

Strong correlations: Acidity, Density and pH

The data also shows that one of the strongest correlations can be found between these variables, specially between the pairs:

  • Acidity <> Density: 0.676
  • Acidity <> pH: -0.683

The latter has already been explored in the previous section (pH vs. Acid), but Acidity vs. Density comes as a surprise.

  • Correlation depends on the type of acid: volatile acid does not seem to relate much with density.
  • On the other hand, citric acid and fixed acid play a key role when it comes to determine density. And not only that, but higher quality wines, show higher values of these acids, as well as lower densities.

Conclusions - Bivariate Analysis

After going deep into the data set, several insights surfaced:

  • There is no extremely strong (>0.7) correlations between quality and other variables, with the notable exceptions of:
  • Some acids, pH, density (and alcohol) and sulfurs.
  • The density of the wine has a strong correlation with the amount of alcohol present in the solution. It is definitely the strongest correlation of the set.
  • Quality seems mostly correlated with acids and alcohol.
  • The quality of the wine clearly improves with lower quantities of Volatile Acidity: the highest rated wines share a lower quantity of Volatile Acidity.
  • The opposite is true with Citric Acid, the highest rated wines tend to have higher quantities of Citric Acid.
  • The most highest rated wines share low concentration of Chlorides.

Multivariate Plots Section

Simplifying the Data

After analyzing the quality from every possible angle, in order to create more engaging visualizations, it’d be useful to group the quality data in three distinct sets:

  • Good: 3, 4 and 5 rating
  • Better: 6 rating
  • Best: 7 and 8 rating

With these defined segments, clearer versions of previous plots can be created in order to derive conclusions about how other variables affect quality.

Quality vs. Alcohol

These segmentation clearly shows how the larger portion of Good and Better wines have lower quantities of alcohol, while higher quantities of alcohol is a defining characteristic of the Best wines.

Quality vs. Acid

Same pattern emerges with both acetic and citric acid. Despite being a small part of the total acid composition, all the Best wines share:

  • Low concentration of volatile acidity.
  • High concentration of citric acid.

Acid vs. Density

Yet another strong correlation that remains unexplored (0.676). So far we already know several things about the intrinsic of these variables:

  • pH is strongly determined by acidity.
  • Alcohol determines the density of the solution.
  • Best wines do have higher levels of alcohol.
  • Best wines are influenced by some acids, such as volatile or citric.

Acidity shows a clear correlation with density, but unfortunately, it does not seem to have an impact on wine quality, since the data points look really scattered in the plot.

Keeping The Best

Most relations across the different variables of the data set have been revealed already. Let’s focus now only on the Best and lay them down in a series of plots to better understand their properties.

Let’s create the same matrix of plots with ggpairs, but this time, only with the Best wines.

Besides the usual pairs (acids, sulfurs…) that are obviously always correlated, the relations between the variables are weaker than the whole data set. Lets figure out how they relate to quality, purposely left out in this last visualization.

Despite the visualization is not ideal if the graphic is not properly reshaped, the data is really interesting:

Among the best, only 8 ratings, it shows more consistent, grouped data. pH, alcohol, density, some types of acids or chlorides are perfectly grouped within this segment.

Conclusions - Multivariate

This last section has served to reveal even more hidden relationships between the data, that was not available at first sight.

First of all, segmenting the data using three classes clearly showed how the larger portion of Good and Better wines have lower quantities of alcohol, while higher quantities of alcohol is a defining characteristic of the Best wines.

It also showed how acid impacts the overall quality of the wine, but not all of them, just citric and volatile.

Moreover, it unveiled a hidden relation between acids and density, which was visualized with a colorful scatterplot.

Final Plots

That was quite a journey of plots! It has unveiled a lot of hidden gems across the whole data set. The project set out looking for the secret recipe of “the Best” wine, and we’ve found great deal trailing indicators and characteristics among the Best.

These final plots, will go deeper into this issue, presenting the clearest possible way what makes a great wine.

The Role Alcohol Plays In The Best Wines

## Warning: Removed 6 rows containing missing values (geom_bar).

We have been consistently detecting a strong correlation between alcohol concentration and wine quality. The plot above reveals the distribution of wine samples according to its alcohol concentration, color-segmented by its quality.

It shows really interesting data, that at first sight, could be counter intuitive.

  • Most wines analyzed have low concentrations of alcohol, between 9% and 11%.
  • The worse rated wines, are also found in the left-most area of the chart, signaling lower concentrations of wine, between 9% and 10%.
  • Highest quality rated wines tend to have higher concentrations of alcohol, leaning more towards the right of the chart, 11% and up to 14%.
  • Above 13.5% is an area reserved only for top notch wines 6+ rated.

Curiously, alcohol, a really disgusting substance on its own, helps wine gain the love of the critics. Supposedly, more alcohol (until certain threshold), gets you a stronger red wine, which presumably is appreciated.

Higher concentrations of alcohol, as seen in previous plots, also contributes to lower densities, which might play a role when it comes to taste, but we can’t assess whether this is the reason or the consequence.

The Role Volatile Acidity and pH Play In The Best Wines

Following up from the first plot, this is one of the most telling and fascinating visualizations this project has seen so far.

It blends two of the most important factors (and hidden) factors that all great wines share, but also it is linked to second order effects, such as the levels of alcohol we’ve seen before.

  • Highest quality wines almost invariably show low levels of acetic acid.
  • Most high quality wines are grouped between pH levels of 3 and 3.4.

Interestingly enough, acetic acid plays a minor role within the acid composition of any wine, usually lower than 10%. Moreover, wine quality has a positive (weak) correlation with total acidity.

Therefore, it is fascinating that acetic acid has a massive (and reverse to the main acid trend) impact on wine quality.

But it gets better, the more acid, less pH. Despite they measure different properties (volume vs. how strong it is), and despite the the less acetic acid, the better the wine, great wines also are defined by lower levels of pH.

As a companion graphic, and aware that this section consists of only three, here’s a replica of the same plot, but removing the data points in the middle. Just to magnify the effect acetic acid has on quality.

This plot dissects the data even further and draws a clearer separation between the Good and Best wines, creating almost a line at 0.6 g/L of acetic acid and pH 3.3.

The Role Chlorides and Sulphates Play In The Best Wines

For this final plot, I wanted to do something a little bit different and went the other way around. I went out and researched about what’s the most defining caractheristic of a wine, or at least, which components have an effect on their quality.

Quick research on the Internet shows that volatile acid (already studied), sulphates (added to the wine) and chlorides can be deal breaker when it comes to wine quality.

Indeed this is one of the most interesting plots we’ve seen so far. As related in previous sections of the study, Best wines contain higher concentrations of sulphates. It is interesting though, this is an added component to preserve wine freshness, but supposedly is not perceptible by a human (up to a certain threshold).

The plot clearly groups almost all the best wines in the upper left quadrant, with higher concentrations of sulphates, but also lower concentrations of chlorides.

This plot was included as the final plot not just of this backwards research, but because it is the vivid reflection of the “art” of wine making. Chlorides for example, are usually absorbed from the soil through the roots of the vine. They are present mainly in the skin, seeds and cellular walls of the grape. Therefore, it is really difficult to control its levels from a manufacturing point of view.

Moreover, as seen in previous section, the mineral composition of a wine reflects its particular origin and development, making it unique and identifiable.

This plot tells us how some of the variables can be engineered for taste when it comes to wine making, but others, are left out to craft, experience, and maybe, luck.


Reflection

It has been quite a journey along the wine data set.

It started with simple histograms of the data distribution. The continuous nature of the data made it really simple to visualize each variable as an histogram. There some findings were already revealed, but they told nothing about the link between the components of the wine and its final taste.

We moved to bivariate plots, segmenting the quality by categorical data, which allowed for box plots, but also histograms with quality separation. This section helped understand how quality was affected by each component.

Finally, all lead to these three final plots. They were specially picked because they signaled specific features of the Best wines. Now we know that there are some factors that can directly influence wine quality, such as:

The ultimate purpose of the exercise was to understand “what made a great wine”. Above are the best answers I could find. I’m sure more factors influence the wine taste, but if something I learned is that wine making is not a process that can be fully engineered. It retains some art, craft, experience of the maker. Some factors can’t be measured and I suppose this is the beauty of the industry, and the most amazing finding of this journey.

But further than that, as a personal note, I knew nothing about wines (a part that I liked them), before starting this journey. I live in the second largest wine producer country in the world, but remained clueless when it came to wine manufacturing.

Most people talk about good and bad wines, but I doubt most of them know what intrinsically entitles “good or bad” wine. This data set has served as a guiding principle to unveil really surprising relationships across the data and truly understand how wine is made.

But there has been some struggles along the way as well.

First of all, the data was difficult to analyze because there were no strong correlations (>0.7) from the get go. The last section has served to reveal hidden relationships between the data, but it was not easy to find it at first sight. It required a lot of “trial and error”.

Segmenting the data using factors and later three separate classes also helped to visualize and understand the data. Having quality as a continuous set was a pain at the beginning, because box plots or separating scatterplots by colors was not possible.

But beyond some struggles, this exercise helped me to tackle a data set, understand it from different angles and most important, derive important conclusions from raw data. An ability I definitely didn’t have a few weeks ago.